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Abstract. Over a decade ago, the HI Collaboration decided to embrace the object-oriented 
paradigm and completely redesign its data analysis model and data storage format. The event 
data model, based on the RooT framework, consists of three layers - tracks and calorimeter 
clusters, identified particles and finally event summary data - with a singleton class providing 
unified access. This original solution was then augmented with a fourth layer containing user- 
defined objects. 

This contribution will summarise the history of the solutions used, from modifications to the 
original design, to the evolution of the high-level end-user analysis object framework which is 
used by HI today. Several important issues are addressed - the portability of expert knowledge 
to increase the efficiency of data analysis, the flexibility of the framework to incorporate new 
analyses, the performance and ease of use, and lessons learned for future projects. 



1. Introduction 

The HI experiment analysed electron-proton data from the HERA machine at DESY, Hamburg. 
During the luminosity upgrade at the turn of the millennium, HI decided to change their analysis 
framework to an object-oriented paradigm, concentrating on the data storage model [H El El S] . 

2. The Analysis Models 

2.1. The old analysis model 

Figure [l] shows the old data analysis model of HI. The Data Summary Tape (DST) fortran data 
storage format was processed privately by private ntuple production code to produce the analysis 
data format. This was usually the hbook format, and physics analysis was performed using 
PAW. The scope for sharing analysis information was fairly limited, as the variable definitions 
depended on the privately maintained ntuple production code which varied from group to group. 
Meanwhile, the private production of ntuples was usually quite inefficient, resulting in many 
copies of essentially the same data, wasting resources. 

2.2. The HI 00 data analysis model 

Figure [2] shows the object-oriented data analysis model. The aim was to centrally produce 
the analysis data storage format, replacing the custom ntuples used previously. Private ntuple 
production was not prohibited and was used by some individuals, but the vast majority of the 
collaboration enjoyed and still enjoys central production of their analysis data format. 
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Figure 1. The old data analysis model. 
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Figure 2. The new data storage model of HI. 



2.3. The HI 00 data storage model 

The DST remains the source of all derived analysis formats. The object-oriented data format 
consists of several layers, and all data storage layers are encapsulated in the HlTree interface. 
This object provides smart access to the data, such that the user doesn't need to know which 
layer is accessed. 

The Object Data Storage (ODS) layer is entirely equivalent to the DST, but now Track and 
Cluster classes are used for the data representation. Object-oriented bank classes also exist, 
providing DST, POT and RAW data access if and as necessary. The ODS was generally not 
used for analysis, as the content of the two derived layers described next was usually found to 
be sufficient. However, it can be produced and stored if necessary (if the ODS content will be 
repeatedly analysed), or created on-the-fly from DST (more performant for rare access). 

Particle finders, shown in Figure [3j are used to produce the content stored in the next object- 
oriented layer, the micro-ODS or MODS. The particle- finding algorithms used are the best 
knowledge of HI by definition, and thus all analyses use this best knowledge. Event summary 
information is stored in the third layer, the HI Analysis Tag or HAT. Together, the MODS 
and HAT layers are the analysis data format, centrally produced, for the vast majority of HI 
analyses. 

A fourth and final optional layer, the UserTree, allows ultimate flexibility by allowing a 
user-defined storage layer. If this layer was found to be useful for several groups, it entered 
the central production framework and was also centrally maintained. It's worth noting that 
there are currently three UserTree packages in the core HlOO framework, suggesting that this 
flexibility was both necessary and useful. 
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Figure 3. The Particle Finder model. 
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Figure 4. The transient data interface, HlCalculator. 



3. Transient Data 

The original HI 00 concept concentrated on unifying the data analysis format and optimising 
resource usage for it's production, and this was very successful. However, there were one or two 
limitations to this model. Chief among these was the lack of treatment of transient data, i.e. 
quantities calculated or re-calculated at run time. This was found to be particularly relevant 
for the evaluation of systematic uncertainties, where derived quantities need to be re-calculated 
based on a systematically shifted base quantity. It was also relevant when analysts wanted to 
apply the latest calibrations on-the-fly, rather than wait for a full-scale production of data and 
Monte Carlo. 

3.1. Transient data interface 

To address these problems, a transient data interface called the HlCalculator was designed and 
implemented [5l|6], as shown in Figure[4} This reads the persistent data provided by the HlTree, 
and stores transient derived quantities. The latest calibration can then be applied easily, and 
systematic uncertainty evaluation simply requires recalculating the derived quantities. All data 
access in user analysis code goes through this interface, guaranteeing a consistent treatment of 
the data. 

In practice, the scope for derived quantities for all HI analyses is very large, and the 
HlCalculator is composed of several smaller, themed calculator classes which deal with specific 
quantities, e.g. one for electron quantities, another for event kinematics. A generic interface to 
the data provides access to variables by type (integer, TLorentzVector, etc.), which then allows 
user classes to be decoupled from the details of this structure. The main HlCalculator class 
itself then provides a simple interface to variables by integer ID (the generic interface) as well 
as switches to apply calibrations and systematic shifts, as shown in Figure [5j 
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Figure 5. A more detailed view of the transient data interface. 
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Figure 6. The event selector classes. 



4. Higher Level Analysis Objects 

Following the generic interface to the transient data, more end-user classes could be defined. 
One particular highlight are the event selector classes [5], composed of lists of cut objects. A cut 
object returns a boolean answer based either on a variable read from the transient data interface 
or on a logical combination of other cut objects. This simple but very useful set of classes is 
shown in Figure [6j They also provide detailed debugging information and cutflow statistics. 
These classes could be passed from analyst to analyst to apply particular event selections, 
together with other simple classes responsible for histogram management [5j, i.e. classes which 
book and fill sets of (related) histograms. These simple organisational aides proved to be very 
useful, not least in the context of data quality and software validation, as well as in physics 
analysis. 



4-1- Analysis objects 

Figure [7] shows a flow diagram of a simplified physics analysis. Identifying the HlCalculator 
as "the event", the selector and histogram manager objects are also evident. Two other 
organisational structures can also be seen, namely at the level of looping over one set of data or 
Monte Carlo, where we used the term "chain" from PAW, and finally the analysis level itself. 

Correspondingly, Analysis and AnalysisChain objects also proved to be useful organisational 
classes |5]. An Analysis object is composed of several AnalysisChains, which each contain one 
or more Histogram Managers. The Event Selector is an object defined at the Analysis level. 
Data access goes through the HlCalculator. This simple model is shown in Figure [8] and allows 
a common framework for nearly all stages of a physics analysis. This in turn allowed better 
collaboration between physics groups, and the easy exchange of high-level analysis code. 
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Figure 7. Flow diagram of physics analysis. 
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Figure 8. Analysis object model. 



5. Conclusions 

During the luminosity upgrade at the turn of the minennium, HI decided to change their analysis 
framework to an object-oriented paradigm, concentrating on the data storage model. This move 
was very successful in unifying the data storage model and analysis formats of HI. The flexibility 
to have a user deflned data storage layer proved crucial in several analyses. 

The development of a transient data interface improved physics analysis in many cases, 
especially those where access to the latest calibrations was critical and/or complicated systematic 
effects had to be evaluated. It paved the way for further developments which allowed for a more 
efficient exchange of higher-level physics analysis tools, up to and including entire analyses. 
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